I first noticed this course from our doctoral student mailing list and thought it sounded interesting. I think Open Data and Open Science are both important aspects of good modern research, which is why I enrolled to this course to learn more about them. Additionally, I hadn’t used R or RStudio let alone GitHub before this course, so I thought I could also learn to use these useful tools as well. So far this course has seemed very well organized, looking forward to learning more!
I thought that the R health for Data Science book together with the Exercise Set 1 were an excellent way to introduce us to basics of RStudio quickly and clearly. I especially liked using the exercise set, because it gave a nice visual after first reading about the different tools and commands first in the book. I don’t know if there was a certain topic that was more difficult than others, rather I found it to be more of learning the logic of R and use it for your own purposes.
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Mon Dec 12 19:16:15 2022"
Click here to see my GitHub repository.
Describe the work you have done this week and summarize your learning.
date()
## [1] "Mon Dec 12 19:16:15 2022"
The data I’m using in my analysis is derived from the larger “JYTOPKYS2”-dataset. This dataset was collected from a survey to statistics students, that was used to evaluate to effects of learning approaches to exam results.
Seven variables from the original dataset were chosen for this data (gender, age, attitude, deep learning, strategic learning, surface learning and exam points). Additionally, students who scored 0 in their exam were excluded from this data.
The variable attitude in learning2014 is a
sum of 10 questions related to students attitude towards statistics,
each measured on the Likert scale
(1-5).
Variables deep (deep learning),
stra(strategic learning) and surf(surface
learning) in learning2014 are the mean values from the
combinations of questions that relate to each learning approach,
respectively. Each question was once again measured on the Likert scale
(1-5).
In the following R chunk I will explore the structure and dimensions of the data further.
# reading the data into memory
learning2014 <- read.csv("~/Documents/Tohtoritutkinto/Open Science/IODS-project/data/learning2014.csv", sep = ",", header = TRUE)
#displaying dataframe
learning2014
## gender age attitude deep stra surf points
## 1 F 53 37 3.583333 3.375 2.583333 25
## 2 M 55 31 2.916667 2.750 3.166667 12
## 3 F 49 25 3.500000 3.625 2.250000 24
## 4 M 53 35 3.500000 3.125 2.250000 10
## 5 M 49 37 3.666667 3.625 2.833333 22
## 6 F 38 38 4.750000 3.625 2.416667 21
## 7 M 50 35 3.833333 2.250 1.916667 21
## 8 F 37 29 3.250000 4.000 2.833333 31
## 9 M 37 38 4.333333 4.250 2.166667 24
## 10 F 42 21 4.000000 3.500 3.000000 26
## 11 M 37 39 3.583333 3.625 2.666667 31
## 12 F 34 38 3.833333 4.750 2.416667 31
## 13 F 34 24 4.250000 3.625 2.250000 23
## 14 F 34 30 3.333333 3.500 2.750000 25
## 15 M 35 26 4.166667 1.750 2.333333 21
## 16 F 33 41 3.666667 3.875 2.333333 31
## 17 F 32 26 4.083333 1.375 2.916667 20
## 18 F 44 26 3.500000 3.250 2.500000 22
## 19 M 29 17 4.083333 3.000 3.750000 9
## 20 F 30 27 4.000000 3.750 2.750000 24
## 21 M 27 39 3.916667 2.625 2.333333 28
## 22 M 29 34 4.000000 2.375 2.416667 30
## 23 F 31 27 4.000000 3.625 3.000000 24
## 24 F 37 23 3.666667 2.750 2.416667 9
## 25 F 26 37 3.666667 1.750 2.833333 26
## 26 F 26 44 4.416667 3.250 3.166667 32
## 27 M 30 41 3.916667 4.000 3.000000 32
## 28 F 33 37 3.750000 3.625 2.000000 33
## 29 F 33 25 3.250000 2.875 3.500000 29
## 30 M 28 30 3.583333 3.000 3.750000 30
## 31 M 26 34 4.916667 1.625 2.500000 19
## 32 F 27 32 3.583333 3.250 2.083333 23
## 33 F 25 20 2.916667 3.500 2.416667 19
## 34 F 31 24 3.666667 3.000 2.583333 12
## 35 M 20 42 4.500000 3.250 1.583333 10
## 36 F 39 16 4.083333 1.875 2.833333 11
## 37 M 38 31 3.833333 4.375 1.833333 20
## 38 M 24 38 3.250000 3.625 2.416667 26
## 39 M 26 38 2.333333 2.500 3.250000 31
## 40 M 25 33 3.333333 1.250 3.416667 20
## 41 F 30 17 4.083333 4.000 3.416667 23
## 42 F 25 25 2.916667 3.000 3.166667 12
## 43 M 30 32 3.333333 2.500 3.500000 24
## 44 F 48 35 3.833333 4.875 2.666667 17
## 45 F 24 32 3.666667 5.000 2.416667 29
## 46 F 40 42 4.666667 4.375 3.583333 23
## 47 M 25 31 3.750000 3.250 2.083333 28
## 48 F 23 39 3.416667 4.000 3.750000 31
## 49 F 25 19 4.166667 3.125 2.916667 23
## 50 F 23 21 2.916667 2.500 2.916667 25
## 51 M 27 25 4.166667 3.125 2.416667 18
## 52 M 25 32 3.583333 3.250 3.000000 19
## 53 M 23 32 2.833333 2.125 3.416667 22
## 54 F 23 26 4.000000 2.750 2.916667 25
## 55 F 23 23 2.916667 2.375 3.250000 21
## 56 F 45 38 3.000000 3.125 3.250000 9
## 57 F 22 28 4.083333 4.000 2.333333 28
## 58 F 23 33 2.916667 4.000 3.250000 25
## 59 M 21 48 3.500000 2.250 2.500000 29
## 60 M 21 40 4.333333 3.250 1.750000 33
## 61 F 21 40 4.250000 3.625 2.250000 33
## 62 F 21 47 3.416667 3.625 2.083333 25
## 63 F 26 23 3.083333 2.500 2.833333 18
## 64 F 25 31 4.583333 1.875 2.833333 22
## 65 F 26 27 3.416667 2.000 2.416667 17
## 66 M 21 41 3.416667 1.875 2.250000 25
## 67 F 23 34 3.416667 4.000 2.833333 28
## 68 F 22 25 3.583333 2.875 2.250000 22
## 69 F 22 21 1.583333 3.875 1.833333 26
## 70 F 22 14 3.333333 2.500 2.916667 11
## 71 F 23 19 4.333333 2.750 2.916667 29
## 72 M 22 37 4.416667 4.500 2.083333 22
## 73 M 23 32 4.833333 3.375 2.333333 21
## 74 M 24 28 3.083333 2.625 2.416667 28
## 75 F 22 41 3.000000 4.125 2.750000 33
## 76 F 23 25 4.083333 2.625 3.250000 16
## 77 M 22 28 4.083333 2.250 1.750000 31
## 78 M 20 38 3.750000 2.750 2.583333 22
## 79 M 22 31 3.083333 3.000 3.333333 31
## 80 M 21 35 4.750000 1.625 2.833333 23
## 81 F 22 36 4.250000 1.875 2.500000 26
## 82 F 23 26 4.166667 3.375 2.416667 12
## 83 M 21 44 4.416667 3.750 2.416667 26
## 84 M 22 45 3.833333 2.125 2.583333 31
## 85 M 29 32 3.333333 2.375 3.000000 19
## 86 F 29 39 3.166667 2.750 2.000000 30
## 87 F 21 25 3.166667 3.125 3.416667 12
## 88 M 28 33 3.833333 3.500 2.833333 17
## 89 F 21 33 4.250000 2.625 2.250000 18
## 90 F 30 30 3.833333 3.375 2.750000 19
## 91 F 21 29 3.666667 2.250 3.916667 21
## 92 M 23 33 3.833333 3.000 2.333333 24
## 93 F 21 33 3.833333 4.000 2.750000 28
## 94 F 21 35 3.833333 3.500 2.750000 17
## 95 F 20 36 3.666667 2.625 2.916667 18
## 96 M 22 37 4.333333 2.500 2.083333 17
## 97 M 21 42 3.750000 3.750 3.666667 23
## 98 M 21 32 4.166667 3.625 2.833333 26
## 99 F 20 50 4.000000 4.125 3.416667 28
## 100 M 22 47 4.000000 4.375 1.583333 31
## 101 F 20 36 4.583333 2.625 2.916667 27
## 102 F 20 36 3.666667 4.000 3.000000 25
## 103 M 24 29 3.666667 2.750 2.916667 23
## 104 F 20 35 3.833333 2.750 2.666667 21
## 105 F 19 40 2.583333 1.375 3.000000 27
## 106 F 21 35 3.500000 2.250 2.750000 28
## 107 F 21 32 3.083333 3.625 3.083333 23
## 108 F 22 26 4.250000 3.750 2.500000 21
## 109 F 25 20 3.166667 4.000 2.333333 25
## 110 F 21 27 3.083333 3.125 3.000000 11
## 111 F 22 32 4.166667 3.250 3.000000 19
## 112 F 25 33 2.250000 2.125 4.000000 24
## 113 F 20 39 3.333333 2.875 3.250000 28
## 114 M 24 33 3.083333 1.500 3.500000 21
## 115 F 20 30 2.750000 2.500 3.500000 24
## 116 M 21 37 3.250000 3.250 3.833333 24
## 117 F 20 25 4.000000 3.625 2.916667 20
## 118 F 20 29 3.583333 3.875 2.166667 19
## 119 M 31 39 4.083333 3.875 1.666667 30
## 120 F 20 36 4.250000 2.375 2.083333 22
## 121 F 22 29 3.416667 3.000 2.833333 16
## 122 F 22 21 3.083333 3.375 3.416667 16
## 123 M 21 31 3.500000 2.750 3.333333 19
## 124 M 22 40 3.666667 4.500 2.583333 30
## 125 F 21 31 4.250000 2.625 2.833333 23
## 126 F 21 23 4.250000 2.750 3.333333 19
## 127 F 21 28 3.833333 3.250 3.000000 18
## 128 F 21 37 4.416667 4.125 2.583333 28
## 129 F 20 26 3.500000 3.375 2.416667 21
## 130 F 21 24 3.583333 2.750 3.583333 19
## 131 F 25 30 3.666667 4.125 2.083333 27
## 132 M 21 28 2.083333 3.250 4.333333 24
## 133 F 24 29 4.250000 2.875 2.666667 21
## 134 F 20 24 3.583333 2.875 3.000000 20
## 135 M 21 31 4.000000 2.375 2.666667 28
## 136 F 20 19 3.333333 3.875 2.166667 12
## 137 F 20 20 3.500000 2.125 2.666667 21
## 138 F 18 38 3.166667 4.000 2.250000 28
## 139 F 21 34 3.583333 3.250 2.666667 31
## 140 F 19 37 3.416667 2.625 3.333333 18
## 141 F 21 29 4.250000 2.750 3.500000 25
## 142 F 20 23 3.250000 4.000 2.750000 19
## 143 M 21 41 4.416667 3.000 2.000000 21
## 144 F 20 27 3.250000 3.375 2.833333 16
## 145 F 21 35 3.916667 3.875 3.500000 7
## 146 F 20 34 3.583333 3.250 2.500000 21
## 147 F 18 32 4.500000 3.375 3.166667 17
## 148 M 22 33 3.583333 4.125 3.083333 22
## 149 F 22 33 3.666667 3.500 2.916667 18
## 150 M 24 35 2.583333 2.000 3.166667 25
## 151 F 19 32 4.166667 3.625 2.500000 24
## 152 F 20 31 3.250000 3.375 3.833333 23
## 153 F 20 28 4.333333 2.125 2.250000 23
## 154 F 17 17 3.916667 4.625 3.416667 26
## 155 M 19 19 2.666667 2.500 3.750000 12
## 156 F 20 35 3.083333 2.875 3.000000 32
## 157 F 20 24 3.750000 2.750 2.583333 22
## 158 F 20 21 4.166667 4.000 3.333333 20
## 159 F 20 29 4.166667 2.375 2.833333 21
## 160 F 19 19 3.250000 3.875 3.000000 23
## 161 F 19 20 4.083333 3.375 2.833333 20
## 162 F 22 42 2.916667 1.750 3.166667 28
## 163 M 35 41 3.833333 3.000 2.750000 31
## 164 F 18 37 3.166667 2.625 3.416667 18
## 165 F 19 36 3.416667 2.625 3.000000 30
## 166 M 21 18 4.083333 3.375 2.666667 19
# the dataset has 166 rows and 7 columns
dim(learning2014)
## [1] 166 7
# there are 7 variables and 166 observations. One variable is a character string, while the other variables are integers or numbers
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
# a summary of the variables
summary(learning2014)
## gender age attitude deep
## Length:166 Min. :17.00 Min. :14.00 Min. :1.583
## Class :character 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:3.333
## Mode :character Median :22.00 Median :32.00 Median :3.667
## Mean :25.51 Mean :31.43 Mean :3.680
## 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:4.083
## Max. :55.00 Max. :50.00 Max. :4.917
## stra surf points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
Here I present a graphical overview of the data and show summaries of the variables in the data.
# accessing the GGally and ggplot2 libraries
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
# creating a plot matrix to give a graphical overview of the data
p <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
# drawing the plot
p
# creating summaries of the variables
summary(factor(learning2014$gender))
## F M
## 110 56
summary(learning2014$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.00 21.00 22.00 25.51 27.00 55.00
summary(learning2014$attitude)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 14.00 26.00 32.00 31.43 37.00 50.00
summary(learning2014$deep)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.583 3.333 3.667 3.680 4.083 4.917
summary(learning2014$stra)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.250 2.625 3.188 3.121 3.625 5.000
summary(learning2014$surf)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.583 2.417 2.833 2.787 3.167 4.333
summary(learning2014$points)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 19.00 23.00 22.72 27.75 33.00
The data is divided in the plot matrix by gender, with females
depicted by the color red and males by the color green. Looking at the
distributions of variables in the graphical overview, you can quickly
notice that the age-variable is strongly skewed to the
right. This is expected as the study population consists of students. We
also notice a slight left skew in deep learning (deep) and
the points-variable. The attitude-variable is
slightly left skewed in males, but almost normal distribution in
females. Additionally, suface learning (surf) is right
skewed in males, but again closer to normal distribution in females.
We can further appreciate the distributions and characteristics of individual variables by looking at the scatter and box plots of the plot matrix. For example, you could inspect if a certain variable has lots of outliers in the box plot or if spread is larger or different in the scatter plot.
The graphical overview also shows correlation coefficients between
variables. The most notable of these is the strong positive correlation
between attitude and points. Also worth noting
is the strong negative correlation between surface learning
(surf) and both attitudeand deep learning
(deep) in male students that is absent in female students.
There is also a significant negative correlation between surface
learning (surf) and strategic learning (stra)
when the students are analyzied overall.
Looking at the summaries of the variables, you notice that the median age is quite young (median = 22 years) and that there are almost twice as many females as there are men (110 females to 56 males).
Here I create a regression model where exam points is the target variable.
# accessing the GGally and ggplot2 libraries
library(GGally)
library(ggplot2)
# creating a regression model with multiple explanatory variables
my_model <- lm(points ~ attitude + stra + surf, data = learning2014)
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.01711 3.68375 2.991 0.00322 **
## attitude 0.33952 0.05741 5.913 1.93e-08 ***
## stra 0.85313 0.54159 1.575 0.11716
## surf -0.58607 0.80138 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
# creating new model without non-significant explanatory variable surf
my_model2 <- lm(points ~ attitude + stra, data = learning2014)
summary(my_model2)
##
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.97290 2.39591 3.745 0.00025 ***
## attitude 0.34658 0.05652 6.132 6.31e-09 ***
## stra 0.91365 0.53447 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
My initial model had attitude, strategic and surface learning as
explanatory variables, as they seemed to have the largest correlation
coefficients in the plot matrix. Looking at the coefficients, we notice
that there are positive estimates in attitude and strategic learning,
while surface learning has a negative estimate. The t-statistic is the
coefficient divided by the standard error. The larger the t-statistic
value is, the more likely it is that the coefficient isn’t 0 and that
there is a relationship between the variables. The t-value of
attitude is clearly above 0, meaning there is a possibility
that we can reject the null hypothesis (no relationship between target
variable and explanatory variable) and declare a relationship between
points and attitude. This is supported by the
small Pr(>|t|) in attribute, meaning there is a very small chance of
seeing a relationship between points and
attitude due to chance. The same cannot be said about
strategic and surface learning. While strategic learning does have a
t-value above 0, the probability that this is due to chace is above the
cut-off point (p <0,05) as stra has a Pr(>|t|) value
of 0.11716. Surface learning is even more likely to have a relationship
due to chance as surf has a t-value of -0.731 and a
Pr(>|t|) value of 0.11716.
After the first model, I removed the surfvariable from
the model since it didn’t have a significant relationship with the
target variable points. The new model was fitted with only
´attitudeandstra` as explanatory variables.
Next I use the summary of my new fitted model and explain the relationship between the chosen explanatory variables and the target variable.
# accessing the GGally and ggplot2 libraries
library(GGally)
library(ggplot2)
# my new model without the non-significant explanatory variable surf and its summary
my_model2 <- lm(points ~ attitude + stra, data = learning2014)
summary(my_model2)
##
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.97290 2.39591 3.745 0.00025 ***
## attitude 0.34658 0.05652 6.132 6.31e-09 ***
## stra 0.91365 0.53447 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
In my new model we notice that both attitude and strategic learning
have still positive estimates (0.34658 and 0.91365 respectively). This
means that for every point the attitude-variable increases,
the exam points increase 0.34658 on average (and 0.91365 in the case of
stra). The t-statistics of attitude is clearly above 0
(t-value: 6.132), meaning there is a high possibility that we can
declare a relationship between points and
attitude. This is supported by the small Pr(>|t|) in
attribute (Pr(>|t|): 6.31e-09), meaning there is a very
small chance of seeing a relationship between points and
attitude due to chance. Even though the t-statistics of
stra is also clearly above 0 (t-value: 1.709) and the
Pr(>|t|) is smaller than before (Pr(>|t|): 0.08927), it still
falls short of the default cut-off point of p <0.05. Therefore, we
cannot say that there is a significant relationship between
pointsand stra. Rather there is strong
evidence of a potential relationship, but there is approximately an 8,9%
probability that this is due to chance.
The R-squared statistic provides a measure of how well the model fits
data. A model with a R-squared value of 0 doesn’t explain at all the
variance in exam points and on the other hand a model with a R-squared
value of 1 explains all of the variance in exam points. The square of
the multiple correlation coefficient (multiple R-squared) of the model
is 0.2048. As the multiple R-squared value increases with each
additional explanatory variable, it is preferred to use the adjusted
R-squared as it adjusts the value for the numbers of variables
considered. The adjusted square of the multiple correlation coefficient
(adjusted R-squared) is 0.1951, i.e. these variables accounted for about
20% of the variation in the exam points. This is marginally more than in
the previous model. The residual standard error is a measure of the
quality of the linear regression fit. It is the average amount that the
target variable points will deviate from the true
regression line. The smaller the residual standard error the better the
model fits the dataset. In this model the residual standard error is
5.289, which is slightly better than the first model. The omnibus F-test
had a very low associated p-value (p-value: 7.734e-09), so there is very
strong evidence that neither of the two regression coefficients are
zero.
Finally I produce the following diagnostic plots of my model: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage.
# divides the window into a 2x2 grid
par(mfrow = c(2,2))
# draws the diagnostic plots using the plot() function. The previously specified plots 1, 2 and 5 are selected
plot(my_model2, c(1,2,5))
Linear regression has four main assumptions: linear relationship, independence, homoscedasticity and normality.
We can use the residuals vs. fitted plot to evaluate if the there is correlation between residuals and fitted values. As we can see, the datapoints have a fairly random distribution around the 0 line, which supports the hypothesis that it is linear and the data is reasonably homoscedastic. There are a few outliers with large negative residuals, but these don’t appear to alter the overall variance in the data.
Using the normal Q-Q plot, we can evaluate the skewness of data and fit of model. My model seems to be slightly skewed left, but is fairly normally distributed overall.
Finally, we can use the residuals vs. leverage plot to assess linearity and influential points. Influential points are observations, that when removed from the dataset could change the coefficients of the model when fitting again. Here we can see that the the spread is nicely linear. No influentials points are observed in the spread.
date()
## [1] "Mon Dec 12 19:16:18 2022"
The data we are using in this assignment was acquired from the UCI Machine Learning Repository, Student Performance Data (incl. Alcohol consumption) page and later modified.
#install required packages if needed
#install.packages("tidyverse")
#install.packages("boot")
#installed.packages("tidyr")
#install.packages("readr")
#install.packages("dplyr")
#install.packages("patchwork")
# access the readr and dplyr library
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#read the 'alc' data set into memory
alc <- read_csv("~/Documents/Tohtoritutkinto/Open Science/IODS-project/data/alc.csv", show_col_types=FALSE)
# look at the column names/names of the variables of the data set
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
The data was collected from students in secondary education of two
Portuguese schools. The data attributes include student grades,
demographic, social and school related features) and it was collected by
using school reports and questionnaires. As we can see, there are 35
variables in our data set. Our focus in this assignment is to evaluate
the possible relationship between high/low alcohol use
(high_use) and other variables.
I chose student’s sex (sex), going out with friends
(goout), number of past class failures
(failures) and number of schools absences
(absences) as my variables of interest. My hypothesis is
that male students have on average a higher consumption of alcohol when
compared to female students, as males have also higher alcohol
consumption in the general population. I also hypothesize that students
with more absences and class failures have higher alcohol consumption on
average, as people with excessive alcohol might miss work due to
hangovers etc. Additionally, I hypothesize that going out with friends
more often increases risk of alcohol consumption as alcohol tends to be
frequent component when young adults socialize.
# access tidyr library
library(tidyr)
# produce summary statistics by group
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_going_out = mean(goout), mean_failures = mean(failures), mean_absences = mean(absences))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 4 × 6
## # Groups: sex [2]
## sex high_use count mean_going_out mean_failures mean_absences
## <chr> <lgl> <int> <dbl> <dbl> <dbl>
## 1 F FALSE 154 2.95 0.104 4.25
## 2 F TRUE 41 3.39 0.293 6.85
## 3 M FALSE 105 2.70 0.143 2.91
## 4 M TRUE 70 3.93 0.386 6.1
Looking at our summarised table, we can see that there are several more males that have high alcohol consumption rates when compared to females (70 and 41, respectively), even though there are almost the same amount of males and females in total (175 and 195, respectively). We can also make note that students with high alcohol consumption go out with their friends more often than those students that don’t have high alcohol consumption, regardless of sex. Additionally, we can also notice that the average number of class failures and school absences are higher with students with high alcohol consumption than others regardless of sex.
Next we will take a closer look at the distributions of
failuresand absences.
# access ggplot2 and patchwork libraries
library(ggplot2)
library(patchwork)
theme_set(theme_classic())
# closer look at the distribution of failures and goout
alc %>% count(failures)
## # A tibble: 4 × 2
## failures n
## <dbl> <int>
## 1 0 325
## 2 1 24
## 3 2 17
## 4 3 4
alc %>% count(goout)
## # A tibble: 5 × 2
## goout n
## <dbl> <int>
## 1 1 22
## 2 2 97
## 3 3 120
## 4 4 78
## 5 5 53
# create histograms for failures and goout
h1 <- alc %>% ggplot(aes(x = failures)) + geom_histogram(binwidth = 1) + scale_fill_grey()
h2 <- alc %>% ggplot(aes(x = goout)) + geom_histogram(binwidth = 1) + scale_fill_grey()
# draw histograms of failures and goout
h1 + h2
As we can see from the tables and histograms, class failures are heavily skewed right as the vast majority of students haven’t failed any class and only a handful of students have failed at least one class (325 and 45, respectively). Going out with friends has almost a normal distribution, although only a few students go out with their friends very little.
Next we will look at the chosen variables together with their relationship with alcohol consumption.
# define the plot as a bar plot
g1 <- alc %>% ggplot(aes(x = sex, fill = high_use)) + geom_bar(position = "fill") + ylab("proportion") + theme(legend.position = "none") + scale_fill_grey()
# define the plot as a bar plot
g2 <- alc %>% ggplot(aes(x = failures, fill = high_use)) + geom_bar(position = "fill") + ylab("proportion") + theme(legend.position = "none") + scale_fill_grey()
# define the plot as a box plot
g3 <- alc %>% ggplot(aes(x = high_use, y = absences)) + geom_boxplot() + xlab("high alcohol consumption")
# define the plot as a bar plot
g4 <- alc %>% ggplot(aes(x = goout, fill = high_use)) + geom_bar(position = "fill") + ylab("proportion") + xlab("going out with friends") + scale_fill_grey(name = "high alcohol consumption")
# draw the plots
g1 + g2 + g3 + g4
Looking first at the sex variable, we notice that a
higher proportion of males have high alcohol consumption compared to
women as we also noted previously.
Next looking at the failures variable, we notice that
the proportion of students with high alcohol consumption increases with
the number of failed classes, which suggests a positive correlation
between them.
Next looking at the absences variable, we notice that
the median absences and spread seems to be larger with students with
high alcohol consumption. This might also point to positive correlation
between them.
Finally, looking at the goout variable, we notice that
proportion of students with alcohol consumption increases with the
number of failed classes, similarly to the failures
variable. This also suggests a positive correlation between them.
The findings in these plots seem to support my initial hypothesis regarding these chosen variables.
Next I created a logistic regression model to predict high/low
alcohol consumption. The target value is high_use and the
explanatory variables are failures, absences,
sex and goout. Here i provide a summary of my
model and interpret the coefficients in my model as odds ratios.
# create logistic regression model using selected variables
m <- glm(high_use ~ failures + absences + sex + goout, data = alc, family = "binomial")
# print out a summary of the model
summary(m)
##
## Call:
## glm(formula = high_use ~ failures + absences + sex + goout, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1521 -0.7929 -0.5317 0.7412 2.3706
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.14389 0.47881 -8.654 < 2e-16 ***
## failures 0.48932 0.23073 2.121 0.033938 *
## absences 0.08246 0.02266 3.639 0.000274 ***
## sexM 0.97989 0.26156 3.746 0.000179 ***
## goout 0.69801 0.12093 5.772 7.83e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 369.50 on 365 degrees of freedom
## AIC: 379.5
##
## Number of Fisher Scoring iterations: 4
# compute odds ratios (OR) for the coefficients of the model by exponentiation
OR <- coef(m) %>% exp
# compute confidence intervals (CI) for the odds ratios by exponentiation
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.01586098 0.005950831 0.03905075
## failures 1.63121360 1.042981792 2.59042892
## absences 1.08595465 1.039999394 1.13814475
## sexM 2.66415000 1.605059602 4.48576247
## goout 2.00975350 1.594527456 2.56461657
All the explanatory variables have a positive regression coefficient,
which means that the variables increase the probability of high
consumption of alcohol. The z-value is the regression coefficient
divided by the standard error. If it is clearly above or below 0, then
it is likely that it is significant variable. Usually the cut-off value
used for significance is 2.0 (which corresponds to p <0.05), which is
met by all my chosen variables. We can see that failures
has a significant relationship with high_use and the other
chosen variables have very strongly significant relationships with
high_use.
Looking at the odds ratios of my explanatory variables, we see that
there is a strong positive association between the variables
`goout and sex and high_use as
both the OR and CI are clearly above 1.0. In addition, the variables
failures and absences have also a positive
association with high_use although their increase in
probability of high alcohol consumption is not as large as with
goout and sex.
Next we explore the predictive power of my model.
# predict() the probability of high_use
probabilities <- predict(m, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 243 16
## TRUE 61 50
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
# define the geom as points and draw the plot
g + geom_point()
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.65675676 0.04324324 0.70000000
## TRUE 0.16486486 0.13513514 0.30000000
## Sum 0.82162162 0.17837838 1.00000000
# computing the total proportion of inaccurately classified individuals ( = training error)
(16+61)/370
## [1] 0.2081081
From the cross tabulation we can see that our model is fairly good at predicting low consumption of alcohol, but not as good when alcohol consumption is high (243/259 and 50/111, respectively). This is made clearer by looking at the point plot.
The training error of my model was about 0.21.
Assuming a person just simply guesses by random if the alcohol consumption of a student is high or low, then it is likely that the training error of this guessing would approach 0.50 (flipping a coin). This would mean my model is able to classify approximately 30% more of students correctly than by simply guessing.
As an additional task we were asked to perform a 10-fold cross validation of our models.
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# 10-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2108108
As we can see, my model has a prediction error of about 0.21, which is better than the model introduced in the Exercise Set 3 (prediction error ≈ 0,26).
date()
## [1] "Mon Dec 12 19:16:20 2022"
# install.packages(c("MASS", "corrplot", "tidyverse", "reshape2")) if needed
In this assignment we look at the Boston data set from
the MASS library. It describes housing values in suburbs of
Boston and various other factors relating to the housings and locations.
The data frame has 14 variables and 506 observations (14 columns and 506
rows). Here I show the structure of the data frame and a short summary
of the variables in question.
# access the MASS package
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:patchwork':
##
## area
## The following object is masked from 'package:dplyr':
##
## select
# load the Boston data
data("Boston")
# explore the dataset
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Here I visualize the distribution of the variables and the relationships between them.
# access the required libraries
library(tidyr)
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
library(ggplot2)
library(corrplot)
## corrplot 0.92 loaded
# convert data from wide to long
melt.boston <- melt(Boston)
## No id variables; using all as measure variables
# visualize variables with small multiple chart
g1 <- ggplot(data = melt.boston, aes(x = value)) +
stat_density() +
facet_wrap(~variable, scales = "free")
g1
# calculate the correlation matrix and round it
cor_matrix <- cor(Boston) %>% round(2)
# test correlation
testRes = cor.mtest(cor_matrix, conf.level = 0.95)
# visualize the correlation matrix
corrplot.mixed(cor_matrix, tl.col = "black", lower.col = "black", number.cex = 0.7, tl.cex=0.7)
Looking at the previous summary of the variables and visualization
afterwards, we can get a good understanding of the data. We can see that
crime rate crim, proportion of residential lands zoned
zn, distance to employment centers dis and
lower status of the population lstat are all clearly skewed
right. On the other side, proportion of blacks black,
proportion of units built before 1940 age and pupil-teacher
ratio ptratio are all clearly skewed left. Porportion of
non-retail business acres indus, accessability to highways
rad and property-tax rate tax have bimodal
distributions (Charles River dummy variable char is a
binary variable). Looking at the correlation matrix, we can see that the
highest positive correlations are between tax and
rad, indus and nox (nitrogen
oxide concentration), age and nox, and
rm (average number of rooms) and medv (median
value). We can also see that the highest negative correlations are
between dis and nox, dis and
age, medvand lstat, and
dis and indus.
In order to standardize our data we have to scale the data.
# center and standardize variables
boston_scaled <- scale(Boston)
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
# print summary of scaled variables
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
After scaling the data we can notice that the means of all variables are 0, meaning that the scale has been fitted for each variable so that the range of each variable is approximately the same.
Next we create a categorical variable of the crime rate in the scaled data set. After this, we divide the data set to train and test sets (80% of the data belongs to the train set).
#access the dplyr library
library(dplyr)
# create a quantile vector of crim
bins <- quantile(boston_scaled$crim)
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
Here we fit a linear discriminant analysis (LDA) on the train set,
using the categorical crime rate crime as the target
variable and all the other variables in the data set as predictor
variables. We then visualize this analysis with a biplot.
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)
## Predicting classes using the LDA model
Next we first save the crime categories from the test set and then remove the categorical crime variable from the test data set. AFter this, we predict the classes with the LDA model on the test data. Finally we cross tabulate the results with the crime categories from the test set.
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 15 13 2 0
## med_low 3 11 5 0
## med_high 1 11 12 1
## high 0 0 1 27
Looking at the cross tabulation we can see that the overall prediction rate of our model is approximately 68% (correct predictions devided by all predictions), which is fairly good. We notice that our model is really good at predicting high crime rate, while not as good at predicting the other categories.
Next we reload the Boston data set and standardize it.
Then we calculate the distance between observations.
library(MASS)
data("Boston")
# center and standardize variables as done previously
boston_scaled2 <- scale(Boston)
# change the object to data frame
boston_scaled2 <- as.data.frame(boston_scaled2)
# create the euclidean distance matrix
dist_eu <- dist(boston_scaled2)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
Next we run the k-means algorithm on the data set and after finding the optimal number of clusters we visualize them.
# Use seed to mitigate the effect of random number generators
set.seed(1234)
# k-means clustering
km <- kmeans(Boston, centers = 3)
# plot the scaled data set with clusters
pairs(boston_scaled2, col = km$cluster)
# determine the number of clusters for optimization
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
# k-means clustering with optimal number of clusters
km2 <- kmeans(boston_scaled2, centers = 2)
# plot the scaled dataset with clusters
pairs(boston_scaled2, col = km2$cluster)
The k-means alorithm was initially done with three clusters. Visualizing how the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes, we can see the total WCSS drops radically at two clusters meaning this is our optimal number of clusters. We then preformed the k-means algorithm again with the optimal number of clusters and visualized it. Looking at our plot with two clusters we can see that overall the clusters are fairly well separated within the variables, meaning the clustering seems to work properly. Comparing the final plot to the original k-mean algorith with 3 clusters we can see that there is more overlap between clusters and is therefore harder to interpret.
date()
## [1] "Mon Dec 12 19:16:26 2022"
Here we look at the variables of the human data, that we
made previously.
#read the 'human' data set into memory
human <- read.csv("~/Documents/Tohtoritutkinto/Open Science/IODS-project/data/human.csv", row.names = 1)
summary(human)
## Edu2.FM Labo.FM Life.Exp Edu.Exp
## Min. :0.1717 Min. :0.1857 Min. :49.00 Min. : 5.40
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:66.30 1st Qu.:11.25
## Median :0.9375 Median :0.7535 Median :74.20 Median :13.50
## Mean :0.8529 Mean :0.7074 Mean :71.65 Mean :13.18
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:77.25 3rd Qu.:15.20
## Max. :1.4967 Max. :1.0380 Max. :83.50 Max. :20.20
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
#access GGally library
library(GGally)
#visualize the 'human' variables
ggpairs(human, progress = FALSE)
#access corrplot library
library(corrplot)
#compute the correlation matrix and visualize it with corrplot
cor(human) %>% corrplot()
As we look at the summaries of the variables and the pairs plot, we
can see that Gross National Income per capita (GNI),
maternal mortality ratio (Mat.Mor) and adolescent birth
rate (Ado.Birth) all have heavily right-skewed
distributions. On the other hand, both Life expectancy
(Life.Exp) and ratio of labor force participation of
females and males (Labo.FM) have left-skewed distributions.
The distribution ratio of female and male populations with secondary
education has a positive kurtosis.
Looking at the pairs plot and correlation plot we notice that there
are a few significant correlations between our variables. There is a
high positive correlation between Ado.Birth and
Mat.Mor (0.759). There is also a high negative correlation
between Life.Exp and both Mat.Mor (-0.857) and
Ado.Birth (-0.729). On the other hand, there is a high
positive correlation between Life.Exp and expected years of
schooling (Edu.Exp) (0.789). Lastly, there is a positive
correlation between Gross National Income per capita (GNI) and both
Life.Exp (0.627) and Edu.Exp (0.624).
Next we perform the principal component analysis (PCA) on the raw
(non-standardized) human data.
#perform principal component analysis (with the singular value decomposition method)
pca_human <- prcomp(human)
#create a summary of pca_human
s <- summary(pca_human)
#rounded percentages of variance captured by each PC
pca_pr <- round(100*s$importance[2, ], digits = 2)
#print out the percentages of variance
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 99.99 0.01 0.00 0.00 0.00 0.00 0.00 0.00
#create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
#draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("navyblue", "firebrick1"), xlab = pc_lab[1], ylab = pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
PCA with raw data: PC1 explains almost all of the variance. GNI has a heavy negative loading on PC1, while the other variables do not seem to have a significant loading on either PC.
Next we standardize the variables and repeat the above analysis.
#standardize the variables
human_std <- scale(human)
#perform principal component analysis on the standardized variables (with the SVD method)
pca_human_std <- prcomp(human_std)
#create and print out a summary of pca_human_std
s2 <- summary(pca_human_std)
#rounded percentanges of variance captured by each PC
pca_pr2 <- round(100*s2$importance[2, ], digits = 2)
#print out the percentages of variance
pca_pr2
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.61 16.24 9.57 7.58 5.48 3.60 2.63 1.30
# create object pc_lab to be used as axis labels
pc_lab2 <- paste0(names(pca_pr2), " (", pca_pr2, "%)")
#draw a biplot of the principal component representation and the standardized variables
biplot(pca_human_std, choices = 1:2, cex = c(0.8, 1), col = c("navyblue", "firebrick1"), xlab = pc_lab2[1], ylab = pc_lab2[2])
PCA with standardized data: PC1 explains 53.61% of the variance, while PC2 explains 16.24%. Mat.Mor and Ado.Birth have both positive loading on PC1, while Edu2.FM, GNI, Edu.Exp and Life.Exp have negative loading on PC1. Parli.F and Labo.FM have both a positive loading on PC2. No variables seem to have a significant negative loading on PC2.
As we can see, the results for the two PCAs are very different. In
the first PCA using the non-standardized data, we see that the first
principal component (PC1) explains almost all of the variance (99,99%)
and the second principal component (PC2) explains only 00,01%. After
standardization, PC1 explains 53,61% and PC2 explains 16,24% of the
variance. I believe this difference is due to the different scaling of
the raw variables. Looking at the previous section, we can see that
GNIhas a much larger scale compared to the other variables.
This is evident in the first biplot, where the only arrow that is
visible is GNI. This shows that the first PCA has loaded heavily on the
GNI variable. Once the variables are standardized the
scaling is very similar between variables and the PCA is no longer
influenced by possible different variable scales.
My own interpretation of the first two principal component dimensions is that the PC1 relates to how developed the country is. The reasoning behind this comes from the variables related to PC1: Mat.Mor, Ado.Birth, Edu2.FM, GNI, Edu.Exp, Life.Exp. All of the variables are key factors when assessing the development stage of a country. This is further supported when looking at biplot as richer and more developed countries (e.g. many European nations) are on the left side while many poorer countries (e.g. many African nations) are on the right side. The PC2 seems to be related to female participation in society as the variables related to PC2 were Parli.F and Labo.FM. We can also notice that the countries that are furthest away from these variables are fairly male-centric (e.g. arabic nations).
Next we will look at tea data. The tea data comes from
the FactoMineR package and it is measured with a questionnaire on tea:
300 individuals were asked how they drink tea (18 questions) and what
are their product’s perception (12 questions). In addition, some
personal details were asked (4 questions). We will look at the
dimensions and structure of the data and finally visualize it.
#read tea data set into memory and convert character variables to factors
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
#look at the dimension and structure of the data
dim(tea)
## [1] 300 36
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
## $ frequency : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
#browse the contents of the data frame
View(tea)
#access dplyr library
library(dplyr)
#selecting columns to keep in the data set for visualization and MCA
keep_columns <- c("Tea", "How", "sugar", "how", "sex", "age_Q", "frequency")
#select the 'keep_columns' to create a new data set
tea_time <- dplyr::select(tea, keep_columns)
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(keep_columns)
##
## # Now:
## data %>% select(all_of(keep_columns))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
#access the ggplot and tidyr library
library(ggplot2)
library(tidyr)
#visualize the data set
pivot_longer(tea_time, cols = everything()) %>%
ggplot(aes(value)) + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8)) + facet_wrap("name", scales = "free")
Looking at the structure of the tea data frame, we see
that there are 36 different variables and 300 observations. Almost all
of the variables are factors except age which is an
integer. For the visualization and upcoming analysis I chose seven
variables that seemed interesting to me. From the visualization we can
see that Earl Grey is the most popular tea variety. Most people drink
their tea alone (without any additions). Most people drink tea at least
once per day. Most people use tea bags. Use of sugar is fairly split
between. There were slightly more females than males in the
questionnaire. The most populous age quartile from the questionnaire was
ages 15 to 24.
Next we do the multiple correspondence analysis (MCA) for the selected variables.
#multiple correspondence analysis
library(FactoMineR)
mca <- MCA(tea_time, graph = FALSE)
#summary of the model
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.246 0.216 0.186 0.179 0.168 0.159 0.149
## % of var. 10.752 9.435 8.137 7.841 7.347 6.947 6.534
## Cumulative % of var. 10.752 20.187 28.324 36.165 43.512 50.459 56.993
## Dim.8 Dim.9 Dim.10 Dim.11 Dim.12 Dim.13 Dim.14
## Variance 0.145 0.130 0.123 0.122 0.113 0.101 0.095
## % of var. 6.358 5.701 5.396 5.358 4.933 4.419 4.171
## Cumulative % of var. 63.351 69.052 74.448 79.807 84.740 89.158 93.329
## Dim.15 Dim.16
## Variance 0.078 0.074
## % of var. 3.430 3.241
## Cumulative % of var. 96.759 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr cos2
## 1 | 0.015 0.000 0.000 | 0.558 0.481 0.140 | 0.396 0.282 0.071
## 2 | 0.611 0.507 0.171 | 0.049 0.004 0.001 | 0.020 0.001 0.000
## 3 | 0.366 0.181 0.107 | -0.559 0.483 0.250 | -0.331 0.197 0.088
## 4 | -0.752 0.768 0.450 | 0.019 0.001 0.000 | 0.147 0.039 0.017
## 5 | 0.241 0.079 0.043 | -0.109 0.018 0.009 | -0.326 0.190 0.078
## 6 | -0.376 0.192 0.114 | -0.123 0.023 0.012 | -0.073 0.009 0.004
## 7 | 0.096 0.012 0.003 | 0.144 0.032 0.008 | 0.082 0.012 0.003
## 8 | 0.449 0.273 0.065 | 0.155 0.037 0.008 | 0.664 0.791 0.143
## 9 | 0.257 0.089 0.028 | 0.014 0.000 0.000 | 0.050 0.004 0.001
## 10 | 0.675 0.619 0.199 | 0.096 0.014 0.004 | 0.001 0.000 0.000
##
## 1 |
## 2 |
## 3 |
## 4 |
## 5 |
## 6 |
## 7 |
## 8 |
## 9 |
## 10 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2 v.test
## black | 1.056 15.999 0.365 10.452 | 0.459 3.448 0.069 4.545 |
## Earl Grey | -0.457 7.812 0.377 -10.614 | -0.293 3.658 0.155 -6.804 |
## green | 0.304 0.592 0.011 1.850 | 0.684 3.405 0.058 4.155 |
## alone | -0.035 0.047 0.002 -0.831 | -0.180 1.389 0.060 -4.232 |
## lemon | -0.330 0.698 0.013 -2.009 | 0.487 1.725 0.029 2.958 |
## milk | 0.025 0.008 0.000 0.227 | 0.306 1.306 0.025 2.731 |
## other | 1.798 5.635 0.100 5.466 | -0.037 0.003 0.000 -0.114 |
## No.sugar | 0.631 11.946 0.425 11.275 | -0.223 1.704 0.053 -3.990 |
## sugar | -0.674 12.770 0.425 -11.275 | 0.239 1.822 0.053 3.990 |
## tea bag | -0.192 1.218 0.048 -3.802 | -0.036 0.048 0.002 -0.708 |
## Dim.3 ctr cos2 v.test
## black 0.537 5.456 0.094 5.310 |
## Earl Grey -0.010 0.005 0.000 -0.225 |
## green -1.147 11.110 0.163 -6.971 |
## alone -0.318 5.037 0.187 -7.485 |
## lemon 0.412 1.434 0.021 2.505 |
## milk 0.375 2.267 0.037 3.342 |
## other 2.747 17.383 0.233 8.352 |
## No.sugar -0.320 4.076 0.110 -5.730 |
## sugar 0.343 4.357 0.110 5.730 |
## tea bag 0.334 4.859 0.146 6.607 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.420 0.159 0.216 |
## How | 0.110 0.067 0.340 |
## sugar | 0.425 0.053 0.110 |
## how | 0.084 0.171 0.150 |
## sex | 0.045 0.517 0.000 |
## age_Q | 0.478 0.403 0.421 |
## frequency | 0.159 0.140 0.065 |
#visualize MCA with the variable biplot
plot(mca, invisible=c("ind"), habillage = "quali", graph.type = "classic")
Looking at the variable biplot, we can see that the first two dimensions of the MCA capture 10.75% and 9.44% of the total variance. The horizontal dimension opposes “other” tea drinkers with the other options. The vertical dimension opposes male and female as well as unpackaged tea to the other options. It also seems to oppose Earl Grey form the other tea varieties. The age quartiles are linked to both dimensions. Looking at how the variables are spread, we can make the assumption that older men are more likely to use unpackaged tea of green and black variety than young females.
date()
## [1] "Mon Dec 12 19:16:36 2022"
First we look at the long form RATS (RATSL) data. It is derived from a nutrition study conducted in three groups of rats. The groups were put on different diets, and each animal’s body weight (grams) was recorded repeatedly (approximately) weekly, except in week seven (when two recordings were taken) over a 9-week period. The question of most interest is whether the growth profiles of the three groups differ.
RATSL <- read.csv("~/Documents/Tohtoritutkinto/Open Science/IODS-project/data/RATSL.csv", sep = ",", header = TRUE)
#Convert ID and Group variables into factors
RATSL$ID <- factor(RATSL$ID)
RATSL$Group <- factor(RATSL$Group)
#Check that variables and structure is correct
str(RATSL)
## 'data.frame': 176 obs. of 5 variables:
## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
## $ WD : chr "WD1" "WD1" "WD1" "WD1" ...
## $ Weight: int 240 225 245 260 255 260 275 245 410 405 ...
## $ Time : int 1 1 1 1 1 1 1 1 1 1 ...
summary(RATSL)
## ID Group WD Weight Time
## 1 : 11 1:88 Length:176 Min. :225.0 Min. : 1.00
## 2 : 11 2:44 Class :character 1st Qu.:267.0 1st Qu.:15.00
## 3 : 11 3:44 Mode :character Median :344.5 Median :36.00
## 4 : 11 Mean :384.5 Mean :33.55
## 5 : 11 3rd Qu.:511.2 3rd Qu.:50.00
## 6 : 11 Max. :628.0 Max. :64.00
## (Other):110
As we can see there are 176 observations and 5 different variables. ID is the identification number for the test subjects (16 in total). Group is one of the three different diets the test subject is on. The time of weighing is presented as both the character variable WD and the integer variable Time (11 in total). And then finally we have the Weight variable that gives the weight of the test subject on the given day.
Next we will look at the RATSL data using a graphical display. The plot will display individual response profiles by diet group for each test subject.
#Access the package ggplot2
library(ggplot2)
#Set theme for plots in this chapter
theme_set(theme_classic())
#Draw the plot
ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))
As we can see from the plot, the weight of almost all the rats is increasing during the 9 week period. This weight gain seems to be more pronounced in groups 2 and 3 than in group 1. Additionally, the rats that had lower or higher weights in the beginning had usually also lower or higher weights at the end of the study. This phenomenon is generally referred to as tracking. There doesn’t seem to be very much individual differences between test subjects inside diet groups.
Next we will look more closely at the individual responses using standardized values, which should give us a better look at the tracking phenomenon.
#Access the dplyr and tidyr packages
library(dplyr)
library(tidyr)
#Standardize the variable Weight
RATSL <- RATSL %>%
group_by(Time) %>%
mutate(stdWeight = (Weight - mean(Weight))/sd(Weight)) %>%
ungroup()
#Plot again with the standardised Weight
ggplot(RATSL, aes(x = Time, y = stdWeight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(name = "standardized Weight")
We can see there is some change with individual weights, but generally high weighted test subjects stayed high weighted throughout the testing period as did also light weighted test subjects.
As graphs with individual responses can be quite hard to interpret when using large numbers of observations, we will next create a summary graph using mean response profiles to better illustrate the difference between diet groups.
#Summary data with mean and standard error of Weight by Group and Time
RATSS <- RATSL %>%
group_by(Group, Time) %>%
summarise( mean = mean(Weight), se = (sd(Weight)/sqrt(length(Weight)))) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
#Plot the mean profiles
ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
geom_line() +
scale_linetype_manual(values = c(1,2,3)) +
geom_point(size=3) +
scale_shape_manual(values = c(1,2,3)) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
theme(legend.position = c(0.9,0.4)) +
scale_y_continuous(name = "mean(Weight) +/- se(Weight)")
We can see that there isn’t any overlap between the diet groups and their mean response profiles seem to run fairly parallel throughout the plot. As previously suggested, group 1 seems to have less of an increase in mean weight compared to groups 2 and 3. Regardless, it seems that there isn’t any significant difference between the three diet groups.
Next we will use the summary measure approach to evaluate the response of each diet group over time. We will use the mean weight of days 8 to 64 as the chosen summary measurement.
#Create a summary data by Group and ID with mean as the summary variable (ignoring baseline Time 1)
RATSL10S <- RATSL %>%
filter(Time > 1) %>%
group_by(Group, ID) %>%
summarise( mean=mean(Weight) ) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
#Draw a boxplot of the mean versus Group
ggplot(RATSL10S, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Weight), Time 8-64")
#Remove the one clear outlier
RATSL10S1 <- RATSL10S %>% filter(mean < 550)
#Redraw the previous plot without the outlier
ggplot(RATSL10S1, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Weight), Time 8-64")
Looking at the first box plot, we notice the mean summary measure is more variable in group 2 than in the other diet groups. The distribution of group 3 is slightly skewed. Group 2 is also clearly skewed and has an outlier, a test subject with a weight of almost 600 grams. It might bias the conclusions from further comparisons of the groups, so we decide to remove it from the group. Note that there is also an outlier in group 1 and 3, but these outliers aren’t as pronounced as the one in group 2 so we decide to keep them in the groups.
Without the outlier the boxplot is a bit easier to interpret, the variance of group 2 is very small now and the mean weight of group 2 is now more clearly lower than of group 3.
The plots seem to point that there is a significant difference in the mean weight between groups, which can be further displayed using formal statistical tests. We will also use pre-diet values as a covariate in analysis of covariance to see if the diets have an effect on the weight separate of baseline values.
#Create model to use in ANOVA
one.way <- lm(mean ~ Group, RATSL10S1)
#Compute the ANOVA table for the model
anova(one.way)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 2 207659 103830 501.81 2.721e-12 ***
## Residuals 12 2483 207
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Perform a two-sample t-test for Groups 1 and 2
t.test(mean ~ Group, data = RATSL10S1, subset = Group %in% c(1,2), var.equal = TRUE)
##
## Two Sample t-test
##
## data: mean by Group
## t = -25.399, df = 9, p-value = 1.094e-09
## alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
## 95 percent confidence interval:
## -204.0633 -170.6867
## sample estimates:
## mean in group 1 mean in group 2
## 265.025 452.400
#Perform a two-sample t-test for Groups 2 and 3
t.test(mean ~ Group, data = RATSL10S1, subset = Group %in% c(2,3), var.equal = TRUE)
##
## Two Sample t-test
##
## data: mean by Group
## t = -5.632, df = 5, p-value = 0.002446
## alternative hypothesis: true difference in means between group 2 and group 3 is not equal to 0
## 95 percent confidence interval:
## -109.37769 -40.82231
## sample estimates:
## mean in group 2 mean in group 3
## 452.4 527.5
#Perform a two-sample t-test for Groups 1 and 3
t.test(mean ~ Group, data = RATSL10S1, subset = Group %in% c(1,3), var.equal = TRUE)
##
## Two Sample t-test
##
## data: mean by Group
## t = -27.824, df = 10, p-value = 8.345e-11
## alternative hypothesis: true difference in means between group 1 and group 3 is not equal to 0
## 95 percent confidence interval:
## -283.4943 -241.4557
## sample estimates:
## mean in group 1 mean in group 3
## 265.025 527.500
#Read the original RATS data into memory for the next part
RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", header = TRUE, sep = '\t')
#Add the baseline from the original data as a new variable to the summary data
RATSL10S2 <- RATSL10S %>%
mutate(baseline = RATS$WD1)
#Fit the linear model with the mean as the response
fit <- lm(mean ~ baseline + Group, data = RATSL10S2)
#Compute the ANOVA table for the fitted model
anova(fit)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## baseline 1 253625 253625 1859.8201 1.57e-14 ***
## Group 2 879 439 3.2219 0.07586 .
## Residuals 12 1636 136
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see that there is a significant difference between the mean weight of the diet groups. Examining in more detail with the t-test, we see that the difference in mean weight is significant between all diet groups. This is expected, as the difference in weights between groups was clearly visible from the start.
Looking at the ANOVA table with pre-diet weights as a variable we can see that these values are strongly related to weight values taken during the diets. Although there is slight evidence of diet difference after conditioning on the baseline value (p 0.076), the difference is nevertheless not significant.
In the second part of our assignment we look at the long form BPRS data (BPRSL). It is derived from the BPRS data, in which 40 male subjects were randomly assigned to one of two treatment groups and each subject was rated on the brief psychiatric rating scale (BPRS) measured before treatment began (week 0) and then at weekly intervals for eight weeks. The BPRS assesses the level of 18 symptom constructs such as hostility, suspiciousness, hallucinations and grandiosity; each of these is rated from one (not present) to seven (extremely severe). The scale is used to evaluate patients suspected of having schizophrenia.
BPRSL <- read.csv("~/Documents/Tohtoritutkinto/Open Science/IODS-project/data/BPRSL.csv", sep = ",", header = TRUE)
#Convert treatment and subject variables into factors
BPRSL$treatment <- factor(BPRSL$treatment)
BPRSL$subject <- factor(BPRSL$subject)
#Check that variables and structure is correct
str(BPRSL)
## 'data.frame': 360 obs. of 5 variables:
## $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ weeks : chr "week0" "week0" "week0" "week0" ...
## $ bprs : int 42 58 54 55 72 48 71 30 41 57 ...
## $ week : int 0 0 0 0 0 0 0 0 0 0 ...
Looking at the output, we see there are 360 observations and 5 different variables. The variable ‘treatment’ is one of the two treatment options the subjects were subjected to. Subject is the identification number for the subjects (40 in total). The time points for evaluating the BPRS rating is presented as both the character variable ‘weeks’ and the integer variable ‘week’ (9 in total). And then finally we have the bprs variable that gives the BPRS rating of the subject on the given day.
Next we plot the BPRSL data to visually evaluate if there is a difference between treatment groups.
#Give individual subjects unique identification numbers
BPRSL$subject <- as.numeric(BPRSL$subject)
BPRSL <- BPRSL %>% mutate(subject = case_when(treatment == 2 ~ subject + 20, TRUE ~ as.numeric(subject)))
#Convert subject variable back into factors
BPRSL$subject <- factor(BPRSL$subject)
#Plot the BPRSL data
ggplot(BPRSL, aes(x = week, y = bprs, group = subject)) +
geom_line() + aes(linetype = treatment) + scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 10, 1)) + scale_y_continuous(name = "BPRS") + theme(legend.position = "top")
Initially it is hard to see a significant difference between the two treatment groups, although the subjects in treament 2 seem to have more of the higher BPRS ratings at the end of the 8 week period compared to subjects in treatment 1.
Ignoring the repeated-measures structure of our data, we then create a regression model for the BPRSL data.
# create a regression model BPRS_reg
BPRS_reg <- lm(bprs ~ week + treatment, data = BPRSL)
# print out a summary of the model
summary(BPRS_reg)
##
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
Looking at the results of our model we can see that the regression on the week variable is highly significant. However, the treatment groups do not differ significantly from each other. As this form of analysis is not appropriate for our repeated-measures data we then move on to consider some more appropriate models.
Next we create the random intercept model as a more suitable model for the BPRSL data. We use ‘week’ and ‘treatment’ as explanatory variables.
#Access library lme4
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
#Create a random intercept model
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)
#Print the summary of the model
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2582.9 2602.3 -1286.5 2572.9 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.27506 -0.59909 -0.06104 0.44226 3.15835
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 97.39 9.869
## Residual 54.23 7.364
## Number of obs: 360, groups: subject, 40
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.3521 19.750
## week -2.2704 0.1503 -15.104
## treatment2 0.5722 3.2159 0.178
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.256
## treatment2 -0.684 0.000
The estimated regression parameters for week and the treatment variable are quite similar to the results from the previous independence model.
Still looking for better models to represent the data, we create a random intercept and random slope model to fit the BPRSL data.
#Create a random intercept and random slope model
BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)
#Print a summary of the model
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2523.2 2550.4 -1254.6 2509.2 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.4655 -0.5150 -0.0920 0.4347 3.7353
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 167.827 12.955
## week 2.331 1.527 -0.67
## Residual 36.747 6.062
## Number of obs: 360, groups: subject, 40
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 45.9830 2.6470 17.372
## week -2.2704 0.2713 -8.370
## treatment2 1.5139 3.1392 0.482
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.545
## treatment2 -0.593 0.000
#Perform an ANOVA test on the two models
anova(BPRS_ref1, BPRS_ref)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_ref 5 2582.9 2602.3 -1286.5 2572.9
## BPRS_ref1 7 2523.2 2550.4 -1254.6 2509.2 63.663 2 1.499e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The estimates of the fixed effects are once again quite similar to the previous model (BPRS_ref), but the likelihood ratio test for the random intercept model versus the random intercept and slope model gives a chi-squared statistic of 63.66 with 2 degrees of freedom (DF) and the associated p-value is very small (p 1.499 x 10^-14). This means that the random intercept and slope model provides a better fit for the data.
Finally, we can fit a random intercept and slope model that allows for a treatment × week interaction.
#Create a random intercept and random slope model with the treatment x week interaction
BPRS_ref2 <- lmer(bprs ~ week * treatment + (week | subject), data = BPRSL, REML = FALSE)
# print a summary of the model
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week * treatment + (week | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2523.5 2554.5 -1253.7 2507.5 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.4747 -0.5256 -0.0866 0.4435 3.7884
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 164.204 12.814
## week 2.203 1.484 -0.66
## Residual 36.748 6.062
## Number of obs: 360, groups: subject, 40
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.9840 16.047
## week -2.6283 0.3752 -7.006
## treatment2 -2.2911 4.2200 -0.543
## week:treatment2 0.7158 0.5306 1.349
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.668
## treatment2 -0.707 0.473
## wek:trtmnt2 0.473 -0.707 -0.668
# perform an ANOVA test on the two models
anova(BPRS_ref2, BPRS_ref1)
## Data: BPRSL
## Models:
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## BPRS_ref2: bprs ~ week * treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_ref1 7 2523.2 2550.4 -1254.6 2509.2
## BPRS_ref2 8 2523.5 2554.6 -1253.7 2507.5 1.78 1 0.1821
The likelihood ratio test of the interaction random intercept and slope model against the corresponding model without an interaction is 1.78 with 1 DF; the associated p-value is fairly small, but not at a significant level (<0.05). Although it cannot be said for certain, the interaction model seems to better explain our data and we will therefore choose it as our model of choice.
The estimated regression parameters for the interaction indicate that the growth rate slopes are higher for subjects in treatment 2 than in treatment 1 (on average 0.473 higher with an approximate 95% confidence interval [CI] of [-0.35, 1.78]). As we can see from the CI this difference is not significant as the CI contains 0.
Next we can evaluate how well our model fits the BPRSL data by finding the fitted values from the interaction model and plotting the fitted BPRS values against the observed BPRS values.
#Draw the plot of BPRSL with the observed bprs values
ggplot(BPRSL, aes(x = week, y = bprs, group = subject)) +
geom_line(aes(linetype = treatment)) +
scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 10, 1)) +
scale_y_continuous(name = "Observed BPRS") +
theme(legend.position = "top")
#Create a vector of the fitted values
Fitted <- fitted(BPRS_ref2)
#Create a new column fitted to RATSL
BPRSL <- BPRSL %>% mutate(Fitted = Fitted)
#Draw the plot of BPRSL with the Fitted values of BPRS
ggplot(BPRSL, aes(x = week, y = Fitted, group = subject)) +
geom_line(aes(linetype = treatment)) +
scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 10, 1)) +
scale_y_continuous(name = "Fitted BPRS") +
theme(legend.position = "top")
As we can see from the plots the interaction model fits the observed data quite nicely, although some of the variance and values are lost in the fitted plot compared to the observed plot.